Bulk RNA-seq analysis of mice model melanoma

Material and methods

Quality control of experimental mice transcriptomics data was performed using FastQC (v0.12.1). MultiQC tool was used for aggregation of FastQC reports [Ewels et al., 2016]. The fastp (v0.23.4) was used to filter out low-quality reads [Chen et al., 2018]. Read alignment was performed by STAR (v2.7.11) [Dobin et al., 2013] using the Mus musculus reference genome downloaded from GENCODE GRCm39.vM36 and the gtf file gencode.vM36.chr_patch_hapl_scaff.annotation.gtf. Count of gene expression profiles was performed using HTSeq count (v2.0.5) [Anders et al., 2015]. Differential expression analysis between experimental groups was performed using DESeq2 (v1.44.0) [Love et al., 2014]. Gene-set enrichment analysis (GSEA) was performed using fgsea (v1.31.6) [Korotkevich et al., 2016] and clusterProfiler (v4.12.6) [Wu et al., 2021] approaches using the following databases: MSigDB [Castanza et al., 2023]. Mapping gene identifiers between different coding systems was performed by biomaRt package. To perform the GSEA analysis using immune cells markers, a CellMarker 2.0 database were used [Hu et al., 2023]. Gene set variation analysis GSVA was used for estimating immune cell type enrichment within samples [Hänzelmann et al., 2013]. Co-expression analysis was performed using BioNERO (v1.12.0) [Almeida-Silva et al., 2022] with signed hybrid network type and Pearson correlation. Correction of expression data by differentiation method batch was performed using removeBatchEffect function from limma package [Ritchie et al., 2015]. Over-representation analysis for functional analysis of co-expressed gene modules linked to experimental groups using clusterProfiler package with MSigDB, KEGG [Kanehisa et al., 2025], Reactome [Milacic et al., 2024] and WikiPathways [Agrawal et al., 2024]. STRING database and STRINGdb Bioconductor package were used for construction of interaction networks linked to genes included to co-expressed gene modules [Szklarczyk et al., 2023]. For visualization of obtained results pheatmap (v1.0.12), ggplot2 (v3.5.1) and EnhancedVolcano packages were used. Data analysis were implemented in R (v4.4.2).

flowchart LR
    A[("Raw reads")] --> B{"fastp"}
    B --> C{"FastQC"} & E{"STAR"}
    C --> D{"MultiQC"}
    E --> F{"HTSeq"}
    F --> G[("Count table")]
    G --> I{"DESeq2"} & Z{"limma"} & W{"PCA"}
    Z --> H{"PCA"} & J{"BioNERO"}
    I --> K{"GSEA"}
    J --> L{"ORA"} & N{"STRING"}
    G --> V{"GSVA"}

     A:::Aqua
     B:::Aqua
     B:::Pine
     C:::Pine
     E:::Pine
     D:::Pine
     F:::Pine
     G:::Aqua
     I:::Pine
     Z:::Pine
     W:::Pine
     H:::Pine
     J:::Pine
     K:::Pine
     L:::Pine
     N:::Pine
     V:::Pine
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Pine stroke-width:1px, stroke-dasharray:none, stroke:#254336, fill:#27654A, color:#FFFFFF

References

  1. Ewels, Philip, et al. “MultiQC: summarize analysis results for multiple tools and samples in a single report.” Bioinformatics 32.19 (2016): 3047-3048.

  2. Chen, Shifu, et al. “fastp: an ultra-fast all-in-one FASTQ preprocessor.” Bioinformatics 34.17 (2018): i884-i890.

  3. Dobin, Alexander, et al. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics 29.1 (2013): 15-21.

  4. Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. “HTSeq—a Python framework to work with high-throughput sequencing data.” Bioinformatics 31.2 (2015): 166-169

  5. Love, Michael I., Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome biology 15 (2014): 1-21.

  6. Hu, Congxue, et al. “CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data.” Nucleic acids research 51.D1 (2023): D870-D876.

  7. Hänzelmann, Sonja, Robert Castelo, and Justin Guinney. “GSVA: gene set variation analysis for microarray and RNA-seq data.” BMC bioinformatics 14 (2013): 1-15.

  8. Almeida-Silva, Fabricio, and Thiago M. Venancio. “BioNERO: an all-in-one R/Bioconductor package for comprehensive and easy biological network reconstruction.” Functional & Integrative Genomics 22.1 (2022): 131-136.

  9. Ritchie, Matthew E., et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic acids research 43.7 (2015): e47-e47.

  10. Kanehisa, Minoru, et al. “KEGG: biological systems database as a model of the real world.” Nucleic Acids Research 53.D1 (2025): D672-D677.

  11. Milacic, Marija, et al. “The reactome pathway knowledgebase 2024.” Nucleic acids research 52.D1 (2024): D672-D678.

  12. Agrawal, Ayushi, et al. “WikiPathways 2024: next generation pathway database.” Nucleic acids research 52.D1 (2024): D679-D689.

  13. Szklarczyk, Damian, et al. “The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest.” Nucleic acids research 51.D1 (2023): D638-D646.

Results

Samples and experimental groups

A total of 14 transcriptomic samples corresponding to the following three experimental groups: melanoma (n=6) and melanoma with Bifidobacterium adolescentis 150 supplement (n=6) and melanoma with Lacticaseibacillus rhamnosus K32 supplement (n=2) were added to the analysis (see Table 1). The experiment was performed in two replicates: first included only melanoma and melanoma with B. adolescentis supplement groups; second included all three experimental groups. To simplify the description of the results, we adopted the following codings:

Experimental groups:

  1. Melanoma –> M

  2. Melanoma with B. adolescentis supplement –> M_BIF

  3. Melanoma with L. rhamnosus supplement –> M_LAC

Experimental batch:

  1. First experiment –> batch_1

  2. Second experiment –> batch_2

Quality control

After quality filtering using fastp, MultiQC was used for aggregate FastQC reports. Figure 1 shows the summary MultiQC quality assessment profile which shows that the sequencing data have a average quality. In summary, after quality filtering, ~ 198 million reads (14 \(\pm\) 5 per sample) added to further analysis. In average 16 \(\pm\) 7% per sample were filtered. Quality filtering statistic showed in Table 2. HTSeq tool were used for gene expression statistic counting. Figure 3 showed HTSeq counts across experimental samples. No statistically significant differences in the number of reads were found between the experimental groups. However, different batches statistically significantly differed in the read counts (see Figure 3). The gene expression counts table normalized by median of ratios method is presented in Table 4.

MiltiQC report

MultiQC full report

Figure 1. Summary MultiQC report.

Reads filtering by quality statistic

Mapping reads to reference counts

Saving 5 x 7 in image

Figure 3. Barplot denoted HTSeq counts per experimantal sample.

ANOVA

            Df Sum Sq Mean Sq F value   Pr(>F)    
batch        1  851.8   851.8 210.230 4.84e-08 ***
group        2    0.1     0.0   0.006    0.994    
Residuals   10   40.5     4.1                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Normalized counts table

Principal component analysis

Principal component analysis (PCA) is a linear dimensional reduction technique with applications in exploratory data analysis and visualization. Before performing a PCA, expression data were transformed by variance-stabilizing transformation. This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homeostatic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The bi-dimensional visualization using the first two components presented in Figure 4. Proportion of explained variance across principal components presented in Figure 5. According to analysis of variance using permutations (PERMANOVA) expression profiles statistical significant links to batch but not linked to experimental groups.

Saving 5 x 4 in image

Figure 4. PCA visualization using regularized log transformation counts.
Saving 5 x 4 in image

Figure 5. The proportion of explained variance of each component.
Permutation test for adonis under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 9999

adonis2(formula = t(vsd) ~ batch + group, data = meta_data, permutations = 9999, method = "euclidean", by = "margin")
         Df SumOfSqs      R2      F Pr(>F)   
batch     1     9421 0.19069 3.7386 0.0086 **
group     2     7784 0.15756 1.5445 0.1325   
Residual 10    25199 0.51007                 
Total    13    49404 1.00000                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Differential expression analysis

Differential expression analysis was performed using DESeq2 package with formula ~ batch + group for controlling dispersion linked to batch. Positive lfc and NES values in differential expression and GSEA analysis linked to B. adolescentis 150 and L. rhamnosus K32 supplement groups as well as negative values to melanoma group without any interventions. The DESeq2 package allows to estimate differential gene expression by constructing generalized linear models for the negative binomial distribution and analyzing linear contrasts between compared groups. According to obtained results 3 genes were up-regulated as well as 16 down-regulated in M_BIF group in comparison with M group. At the same time, 21 genes were linked to M_LAC groups while 0 genes linked to M group in comparison to M_LAC. When comparing bacteria supplement groups among themselves, 39 genes were linked to M_BIF whereas 116 genes were linked to M_LAC. Differentially represented genes were identified using the following significance thresholds adj. p < 0.05, |lfc| > 1. Differential analysis statistic presented in Tables 4-6. Visualization of differential expression genes of p-value - log FC measure presented via volcano plots (see Figures 6-8).

M_BIF vs M

Figure 6. Volcano plot denoted differential expressed genes: M_BIF vs M.

M_LAC vs M

Figure 7. Volcano plot denoted differential expressed genes: M_LAC vs M.

M_BIF vs M_LAC

Figure 8. Volcano plot denoted differential expressed genes: M_BIF vs M_LAC.

Gene set enrichment analysis

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a prior defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotype). Several databases were used to conduct the GSEA analysis, which allowed us to identify a wide range of biological functions associated with experimental groups. MSigDB hallmark gene-set were used for this analysis. The obtained results presented in Figure 9 and Tables 7-9. To identify specific expression markers characteristic of different cell types, a Cell Marker 2.0 database for mouse was used (see Figure 10 and Tables 10-12).

MSigDb Hallmark

M_BIF vs M
M_LAC vs M
M_BIF vs M_LAC

Figure 9. Heatmap denoted NES values of MSigDB Hallmark pathways linked to experimental groups.

Cell Marker 2.0

M_BIF vs M
M_LAC vs M
M_BIF vs M_LAC

Figure 10. Heatmap denoted NES values of immune cell types linked to experimental groups.

Prediction of immune cell counts

Mmcp counter (T cell)

Figure 12. Boxplot showed Mmcp counter predicted T cell values in experimental groups.
            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2 149.33   74.67   30.90 5.24e-05 ***
batch        1  54.00   54.00   22.34 0.000808 ***
Residuals   10  24.17    2.42                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GSVA

Macrophage

Figure 13. Boxplot showed GSVA predicted counts of M1 macrophage in experimental groups.
            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2  63.67   31.83   6.221 0.017563 *  
batch        1 112.67  112.67  22.020 0.000851 ***
Residuals   10  51.17    5.12                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 14. Boxplot showed GSVA predicted counts of M2 macrophage in experimental groups.
            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2  45.67   22.83   4.463 0.041192 *  
batch        1 130.67  130.67  25.537 0.000497 ***
Residuals   10  51.17    5.12                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 15. Boxplot showed GSVA predicted counts of M1/M2 macrophage ratio in experimental groups.
            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2  63.67   31.83   6.221 0.017563 *  
batch        1 112.67  112.67  22.020 0.000851 ***
Residuals   10  51.17    5.12                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
CD8+

Figure 16. Boxplot showed GSVA predicted counts of CD8+ T cells in experimental groups.
            Df Sum Sq Mean Sq F value  Pr(>F)   
group        2  49.33   24.67   2.864 0.10390   
batch        1  92.04   92.04  10.687 0.00844 **
Residuals   10  86.12    8.61                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Co-expression analysis

Discussion